In the twilight zone of protein sequence homology: do protein language models learn protein structure?

Abstract Motivation Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent. Results We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the “twilight zone” of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak. Availability and implementation We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.

Table 3. Below are the data statistics for SCOPe-and SCOP2derived datasets following preprocessing steps at various sequence percentage identity thresholds.Note that for SCOP2 an increased threshold does not always equate to an increase in the number of Superfamilies or Folds because the clusters were re-evaluated for each threshold.

Further Model Details
Here we provide more details for each baseline.
Random: Since the differences across thresholds are small for the random baseline and the focus is to see whether the PLMs can effectively outperform the random baseline, which they do, we report the average performance metrics across all considered sequence identity thresholds, denoted as the red dotted line in Figs. 1 (A  HHblits: We compute match scores between pairs of proteins using the HHblits software package.This involved using the HH-Suite software to compute multiple-sequence alignments between sequences in our protein database, and training hidden Markov model "profiles" that can be compared to each other to obtain match scores between each pair of proteins.For each of these steps, we used the default recommended settings from the HH-Suite software package, with the only deviation being our imposition of a maximummemory limit of 3.4GB per process when computing the Multiple-sequence alignment.When performing the ranking of potential matches, we rely on the "E-Value" rather than the probability score, as in (Rives et al., 2021a).Note that even better performance has been observed in Rives et al. (Rives et al., 2021a) by increasing the number of iterations from 2 to 3 when building the multi-sequence-alignment.Due to resource constraints, we retained the default value of 2 for the purposes of this study.

Effect of Weighting Results by Superfamily
Comparison of eighted versus unweighted performance scores across Superfamily or Fold classes reveals minimal deviations.Previously, Söding and Remmert (Söding and Remmert, 2011) discussed that the number of homologous pairs scales as the number of members squared.Consequently, large Superfamilies would have a dominant influence on the AUROC analysis.Following this standard practice, we compare weighted versus unweighted performance metrics across Fold and Superfamily level remote homology detection tasks.In Figure 3, we present the unweighted results for Superfamily (left panel) and Fold (right panel) level remote homology detection.Comparing with weighted Superfamily and Fold (Figure 1 (A) and (B) (topleft panel), respectively), we find minimal impact of applying weights.However, we consider the weighted results to be our main findings in this manuscript because applying weights effectively mitigates bias towards classes with a large number of examples.

SCOP2 Results
Here we report the SCOP2-based models' performance.Figure 4 demonstrates overall performance comparison among Detecting Fold-level remote homologs in SCOP2 dataset remains highly difficult.In terms of performance, no PLMs exceeded a Hit@10 score of 30% for datasets derived from SCOP2, with accuracy dropping to below 20% when computing Hit@1.When considering the datasets, it is notable that SCOPe encompasses fewer Folds per class (with 288, 173, 140, 393, and 98 Folds) compared to the SCOP2 classes (461, 240, 165, 519, and 104), nearly doubling for any specific class at a 95% sequence identity threshold.Moreover, minimal variance is observed in homology and remote homology detection performance based on SCOP2 datasets compared to those derived from SCOPe .Both the distribution of these datasets and the performance scores of the models suggest that SCOP2derived Fold detection presents an even greater challenge for all protein language models across various sequence similarity thresholds, regardless of the underlying pre-training datasets and architectural biases of the models.
The PLMs that we considered still lag behind in detecting remote homologs at the Fold level when compared with their performance at the Superfamily level, as illustrated in Figure 1 (A) and (B) across various metrics.While three out of seven PLMs marginally surpass the 0.2 threshold in Hit@10, none of the models achieve such performance in Hit@1, indicating the persistent challenge in zero-shot detection of remote Fold homologs for future PLMs developed solely from sequence information.
In Figure 4 (C) and (D) we also show SCOP2 remote homology results when choosing representative sequences randomly instead of relying on the "representative" sequence recommended by CD-HIT.This can be compared with the results shown in Figure 4 (A) and (B) which relies on the CD-HIT provided representative sequence for each cluster.

Fig. 3 .
Fig. 3. Unweighted performance scores across Superfamily or Fold classes shows small deviations.Compare with the weighted scores shown in Figures 1 (A) and (B) .
) and (B) .The actual values for each threshold are shown in Table4.

Table 4 .
Metrics obtained for Superfamily-level remote homology using the random-embedding baseline.

Table 5 .
Superfamily-level remote homology prediction results using SCOPe in tabular form.The same information is shown in Figure1(A).